Objective:
The goal of this binary classification task is to predict the patients as having parkinson's disease or not based on voice recording data to act as an effective screening step for doctors to consider a patient for further diagnostics by a clinician..
Executive Summary:
Parkinson's disease is a brain disorder that leads to shaking, stiffness, and difficulty with walking, balance, and coordination. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased. So, we aim to use the voice recording data of a patient to predict a patient having Parkinson's disease, it would act as an effective, non invasive screening step before requiring a clinic visit to test those at-risk patients and diagnose them.
The dataset was created by Max Little of the University of Oxford, in collaboration with the National Centre for Voice and Speech, Denver, Colorado, who recorded the speech signals. It is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" column which is set to 0 for healthy and 1 for PD.
# install additional required libs
# !pip install pandas-profiling
# !pip install imblearn
# !pip install xgboost
# !pip install catboost
# !pip install lightgbm
# (or to install conda packages with anaconda or miniconda)
#!conda install -c conda-forge pandas-profiling
# !conda install -c conda-forge imbalanced-learn
# !conda install -c conda-forge xgboost
# !conda install -c conda-forge catboost
# !conda install -c conda-forge lightgbm
import os, time
import platform, warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport
sns.set_style(style= 'darkgrid')
%matplotlib inline
# warnings.filterwarnings('ignore')
print(f'Py: {platform.python_version()}')
Py: 3.8.5
We go through the machine learning pipeline, starting with reading the dataset and exploring the data through plots and summaries. Then, we move to preprocess the data to standardize the data and check for any missing values. Later, we build models to classify the data.
Finally, we evaluate the best models using the whole test dataset.
# read in the dataset
dataset_path = './Data - Parkinsons.csv'
if not os.path.exists(dataset_path):
raise FileNotFoundError(f'File not found at {dataset_path}')
df = pd.read_csv(dataset_path)
df.sample(7)
| name | MDVP:Fo(Hz) | MDVP:Fhi(Hz) | MDVP:Flo(Hz) | MDVP:Jitter(%) | MDVP:Jitter(Abs) | MDVP:RAP | MDVP:PPQ | Jitter:DDP | MDVP:Shimmer | ... | Shimmer:DDA | NHR | HNR | status | RPDE | DFA | spread1 | spread2 | D2 | PPE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | phon_R01_S01_1 | 119.992 | 157.302 | 74.997 | 0.00784 | 0.00007 | 0.00370 | 0.00554 | 0.01109 | 0.04374 | ... | 0.06545 | 0.02211 | 21.033 | 1 | 0.414783 | 0.815285 | -4.813031 | 0.266482 | 2.301442 | 0.284654 |
| 153 | phon_R01_S37_1 | 121.345 | 139.644 | 98.250 | 0.00684 | 0.00006 | 0.00388 | 0.00332 | 0.01164 | 0.02534 | ... | 0.04019 | 0.04179 | 21.520 | 1 | 0.566867 | 0.670475 | -4.865194 | 0.246404 | 2.013530 | 0.168581 |
| 97 | phon_R01_S24_1 | 125.036 | 143.946 | 116.187 | 0.01280 | 0.00010 | 0.00743 | 0.00623 | 0.02228 | 0.03886 | ... | 0.06406 | 0.08151 | 15.338 | 1 | 0.629574 | 0.714485 | -4.020042 | 0.265315 | 2.671825 | 0.340623 |
| 13 | phon_R01_S04_2 | 139.173 | 179.139 | 76.556 | 0.00390 | 0.00003 | 0.00165 | 0.00208 | 0.00496 | 0.01642 | ... | 0.02184 | 0.01041 | 24.889 | 1 | 0.430166 | 0.665833 | -5.660217 | 0.254989 | 2.519422 | 0.199889 |
| 96 | phon_R01_S22_6 | 159.116 | 168.913 | 144.811 | 0.00342 | 0.00002 | 0.00178 | 0.00184 | 0.00535 | 0.03381 | ... | 0.05417 | 0.00852 | 22.663 | 1 | 0.366329 | 0.693429 | -6.417440 | 0.194627 | 2.473239 | 0.151709 |
| 40 | phon_R01_S08_5 | 186.163 | 197.724 | 177.584 | 0.00298 | 0.00002 | 0.00165 | 0.00175 | 0.00496 | 0.01495 | ... | 0.02321 | 0.00231 | 26.822 | 1 | 0.326480 | 0.765623 | -6.647379 | 0.201095 | 2.374073 | 0.130554 |
| 28 | phon_R01_S06_5 | 155.358 | 227.383 | 80.055 | 0.00310 | 0.00002 | 0.00159 | 0.00176 | 0.00476 | 0.01718 | ... | 0.02307 | 0.00677 | 25.970 | 1 | 0.470478 | 0.676258 | -7.120925 | 0.279789 | 2.241742 | 0.108514 |
7 rows × 24 columns
df.shape
(195, 24)
Looking at the dataset from a data completeness perspective, it only has 195 rows. This will likely lead to poor machine learning results. We have to have more data if we want our model to generalize well on out of sample data and our results or insights to be applicable in the real world. The variable name are all unique values and do not add any required information. So, we can remove this attribute for our modelling. Also, all the numerical values seem to have different ranges, So, we would have to scale the data before fitting models so that the variables with larger scales do not bias the model.
The target column is status (cateogical) and there seems to be an imbalance in the classes as there seem to be more positive examples (status = 1) than negative ones. So, we might need to employ certain upsampling techniques like SMOTE to balance the dataset. Overall, This is a binary classification problem, where the machine learning model will try to predict if each row is (status) 0 or 1.
Challenges:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 195 entries, 0 to 194 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 195 non-null object 1 MDVP:Fo(Hz) 195 non-null float64 2 MDVP:Fhi(Hz) 195 non-null float64 3 MDVP:Flo(Hz) 195 non-null float64 4 MDVP:Jitter(%) 195 non-null float64 5 MDVP:Jitter(Abs) 195 non-null float64 6 MDVP:RAP 195 non-null float64 7 MDVP:PPQ 195 non-null float64 8 Jitter:DDP 195 non-null float64 9 MDVP:Shimmer 195 non-null float64 10 MDVP:Shimmer(dB) 195 non-null float64 11 Shimmer:APQ3 195 non-null float64 12 Shimmer:APQ5 195 non-null float64 13 MDVP:APQ 195 non-null float64 14 Shimmer:DDA 195 non-null float64 15 NHR 195 non-null float64 16 HNR 195 non-null float64 17 status 195 non-null int64 18 RPDE 195 non-null float64 19 DFA 195 non-null float64 20 spread1 195 non-null float64 21 spread2 195 non-null float64 22 D2 195 non-null float64 23 PPE 195 non-null float64 dtypes: float64(22), int64(1), object(1) memory usage: 36.7+ KB
# Since pandas assumed wrong types of status attribute
df = df.astype({"status":'category'})
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 195 entries, 0 to 194 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 195 non-null object 1 MDVP:Fo(Hz) 195 non-null float64 2 MDVP:Fhi(Hz) 195 non-null float64 3 MDVP:Flo(Hz) 195 non-null float64 4 MDVP:Jitter(%) 195 non-null float64 5 MDVP:Jitter(Abs) 195 non-null float64 6 MDVP:RAP 195 non-null float64 7 MDVP:PPQ 195 non-null float64 8 Jitter:DDP 195 non-null float64 9 MDVP:Shimmer 195 non-null float64 10 MDVP:Shimmer(dB) 195 non-null float64 11 Shimmer:APQ3 195 non-null float64 12 Shimmer:APQ5 195 non-null float64 13 MDVP:APQ 195 non-null float64 14 Shimmer:DDA 195 non-null float64 15 NHR 195 non-null float64 16 HNR 195 non-null float64 17 status 195 non-null category 18 RPDE 195 non-null float64 19 DFA 195 non-null float64 20 spread1 195 non-null float64 21 spread2 195 non-null float64 22 D2 195 non-null float64 23 PPE 195 non-null float64 dtypes: category(1), float64(22), object(1) memory usage: 35.5+ KB
Attribute Information:
# checking for missing values
df.isna().sum()
name 0 MDVP:Fo(Hz) 0 MDVP:Fhi(Hz) 0 MDVP:Flo(Hz) 0 MDVP:Jitter(%) 0 MDVP:Jitter(Abs) 0 MDVP:RAP 0 MDVP:PPQ 0 Jitter:DDP 0 MDVP:Shimmer 0 MDVP:Shimmer(dB) 0 Shimmer:APQ3 0 Shimmer:APQ5 0 MDVP:APQ 0 Shimmer:DDA 0 NHR 0 HNR 0 status 0 RPDE 0 DFA 0 spread1 0 spread2 0 D2 0 PPE 0 dtype: int64
We confirm that there are no missing values(NAs). Hence, we do not need to remove or impute missing
values. If there were missing values we do some value imputation using medians or maybe knn imputation based on the data points nearest neighborsFrom this point of view, the data looks great and there are no missing values.
df.nunique() # unique values counts
name 195 MDVP:Fo(Hz) 195 MDVP:Fhi(Hz) 195 MDVP:Flo(Hz) 195 MDVP:Jitter(%) 173 MDVP:Jitter(Abs) 19 MDVP:RAP 155 MDVP:PPQ 165 Jitter:DDP 180 MDVP:Shimmer 188 MDVP:Shimmer(dB) 149 Shimmer:APQ3 184 Shimmer:APQ5 189 MDVP:APQ 189 Shimmer:DDA 189 NHR 185 HNR 195 status 2 RPDE 195 DFA 195 spread1 195 spread2 194 D2 195 PPE 195 dtype: int64
We can see that the variables MDVP:Fo(Hz), MDVP:Fhi(Hz), MDVP:Flo(Hz), HNR, RPDE, DFA, spread1, D2, PPE all have unique values for each data point.
df['MDVP:Jitter(Abs)'].value_counts()
0.000030 46 0.000040 28 0.000020 28 0.000010 20 0.000050 17 0.000060 16 0.000080 9 0.000070 8 0.000090 5 0.000009 5 0.000100 3 0.000150 2 0.000110 2 0.000140 1 0.000120 1 0.000220 1 0.000007 1 0.000260 1 0.000160 1 Name: MDVP:Jitter(Abs), dtype: int64
We can see that the variable MDVP:Jitter(Abs) seems to follow a non-standard skewed distribution with only 19 unique values.
#Five point summary for the dataset
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| MDVP:Fo(Hz) | 195.0 | 154.228641 | 41.390065 | 88.333000 | 117.572000 | 148.790000 | 182.769000 | 260.105000 |
| MDVP:Fhi(Hz) | 195.0 | 197.104918 | 91.491548 | 102.145000 | 134.862500 | 175.829000 | 224.205500 | 592.030000 |
| MDVP:Flo(Hz) | 195.0 | 116.324631 | 43.521413 | 65.476000 | 84.291000 | 104.315000 | 140.018500 | 239.170000 |
| MDVP:Jitter(%) | 195.0 | 0.006220 | 0.004848 | 0.001680 | 0.003460 | 0.004940 | 0.007365 | 0.033160 |
| MDVP:Jitter(Abs) | 195.0 | 0.000044 | 0.000035 | 0.000007 | 0.000020 | 0.000030 | 0.000060 | 0.000260 |
| MDVP:RAP | 195.0 | 0.003306 | 0.002968 | 0.000680 | 0.001660 | 0.002500 | 0.003835 | 0.021440 |
| MDVP:PPQ | 195.0 | 0.003446 | 0.002759 | 0.000920 | 0.001860 | 0.002690 | 0.003955 | 0.019580 |
| Jitter:DDP | 195.0 | 0.009920 | 0.008903 | 0.002040 | 0.004985 | 0.007490 | 0.011505 | 0.064330 |
| MDVP:Shimmer | 195.0 | 0.029709 | 0.018857 | 0.009540 | 0.016505 | 0.022970 | 0.037885 | 0.119080 |
| MDVP:Shimmer(dB) | 195.0 | 0.282251 | 0.194877 | 0.085000 | 0.148500 | 0.221000 | 0.350000 | 1.302000 |
| Shimmer:APQ3 | 195.0 | 0.015664 | 0.010153 | 0.004550 | 0.008245 | 0.012790 | 0.020265 | 0.056470 |
| Shimmer:APQ5 | 195.0 | 0.017878 | 0.012024 | 0.005700 | 0.009580 | 0.013470 | 0.022380 | 0.079400 |
| MDVP:APQ | 195.0 | 0.024081 | 0.016947 | 0.007190 | 0.013080 | 0.018260 | 0.029400 | 0.137780 |
| Shimmer:DDA | 195.0 | 0.046993 | 0.030459 | 0.013640 | 0.024735 | 0.038360 | 0.060795 | 0.169420 |
| NHR | 195.0 | 0.024847 | 0.040418 | 0.000650 | 0.005925 | 0.011660 | 0.025640 | 0.314820 |
| HNR | 195.0 | 21.885974 | 4.425764 | 8.441000 | 19.198000 | 22.085000 | 25.075500 | 33.047000 |
| RPDE | 195.0 | 0.498536 | 0.103942 | 0.256570 | 0.421306 | 0.495954 | 0.587562 | 0.685151 |
| DFA | 195.0 | 0.718099 | 0.055336 | 0.574282 | 0.674758 | 0.722254 | 0.761881 | 0.825288 |
| spread1 | 195.0 | -5.684397 | 1.090208 | -7.964984 | -6.450096 | -5.720868 | -5.046192 | -2.434031 |
| spread2 | 195.0 | 0.226510 | 0.083406 | 0.006274 | 0.174351 | 0.218885 | 0.279234 | 0.450493 |
| D2 | 195.0 | 2.381826 | 0.382799 | 1.423287 | 2.099125 | 2.361532 | 2.636456 | 3.671155 |
| PPE | 195.0 | 0.206552 | 0.090119 | 0.044539 | 0.137451 | 0.194052 | 0.252980 | 0.527367 |
profile = ProfileReport(df, title='Pandas Profiling Report', explorative=True)
profile